First loading all possible libraries that I might use.
Notice that the parameter “echo” was set to FALSE for this code chunk. This prevents the code from displaying in the knitted HTML output. You should set echo=FALSE for all code chunks in your file, unless it makes sense for your report to show the code that generated a particular plot.
The other parameters for “message” and “warning” should also be set to FALSE for other code chunks once you have verified that each plot comes out as you want it to. This will clean up the flow of your report.
Importing the data
For the purpose of this project I have chosen to work with Prosper Loan Data set. I hope to find interesting insight in this domain(loan/credit system) for example who are the suitable customers eligible for a loan, has there been any fraud so cases like these. So to get started first we load the dataset.
## [1] 113937 81
So after loading the data we find that there are 113937 rows/observations and 81 variables/columns.
Now there is a loan_data_dict file which contains the information about each variable. For the purpose of this project I am choosing 36 variables which I think is important for exploration purposes. So next step is to subset the data
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating (numeric)"
## [15] "ProsperRating (Alpha)"
## [16] "ProsperScore"
## [17] "ListingCategory (numeric)"
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent (percentage)"
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
# Number of Unique Records
distinct_loan_records <- length(unique(prosperLoans.sub$ListingNumber))
distinct_loan_records
## [1] 113066
# After subsetting the data, I wanted to know the unique number of records. I took ListingNumber as the field to check unique customers as it was a parameter which identified a loan borrower. But surprisingly I found that there are 11366 distinct records for a user which means that there may be duplicate/updated data for some users in our data set, Since the total number of rows are 113937.
# Number of Duplicate Records
duplicate_loan_records <- length(unique(prosperLoans.sub$ListingNumber
[duplicated(prosperLoans.sub$ListingNumber)]))
#Saving the duplicate records in a new variable
duplicate_loans <- unique(prosperLoans.sub$ListingNumber
[duplicated(prosperLoans.sub$ListingNumber)])
duplicate_loans
## [1] 1023355 1055932 1097584 927246 920397 894748 1066466 875616
## [9] 969821 970570 1047993 1138422 1100572 1057901 1101199 898052
## [17] 890514 1069928 1063352 900913 1116810 877625 1114591 925803
## [25] 1027003 1111486 1138731 1077775 898757 990994 1044353 951186
## [33] 1090596 926217 1116297 953329 1025631 877117 985290 919724
## [41] 958819 1097279 945736 1059427 1015904 1202163 928642 1102061
## [49] 1110415 1050418 1031431 1184734 891025 926813 1035457 906320
## [57] 989195 1116724 1149692 1071533 1017921 940961 1119836 889018
## [65] 1160473 1177596 1176594 1055942 997199 966176 902051 1053715
## [73] 1134330 1031508 875651 1118868 908353 960603 930618 956743
## [81] 1233732 1025541 918278 919623 844220 877719 1089145 1087743
## [89] 1024510 1017753 936442 882888 901662 946929 1204344 974533
## [97] 1200566 1055661 998257 879829 941296 944122 1190614 1190590
## [105] 931467 701942 876423 1166576 1131625 1136886 961598 1082906
## [113] 1092437 1052481 1167913 1138904 1037880 1237178 1189349 864114
## [121] 875249 1098094 1114520 1050318 1096149 932440 1151968 902157
## [129] 988904 1095113 1076428 1039105 1113064 1093115 880578 901650
## [137] 1021236 1080955 1079195 964692 1030511 855637 958929 990936
## [145] 875737 1148765 1030162 1086421 1044620 1124382 1014086 1021622
## [153] 1018488 1031370 879606 1123619 1102731 968180 1026732 857826
## [161] 1054113 1115327 1158389 1010477 1155587 1045723 1074971 1148799
## [169] 1121686 895686 1026555 990568 1143894 1165181 950042 990877
## [177] 1134181 1138460 1131123 1112691 1177424 1199490 1120459 1094962
## [185] 1028087 923713 1055568 949426 1056749 1151314 899615 956166
## [193] 1088191 908557 1094490 970565 915278 1170591 998232 1067478
## [201] 1103195 1054672 893227 1126419 989749 1119198 1136469 968319
## [209] 1125972 1154037 893464 812537 1087994 1108543 908028 914025
## [217] 1035191 1088101 930295 932593 851219 926425 1124568 1020751
## [225] 853665 1119803 987075 1121952 981333 1056941 1109263 1155979
## [233] 953464 892845 948269 1099457 1195343 1018930 970538 1104872
## [241] 1054732 1061584 960299 1154254 930842 1064438 986199 899140
## [249] 1087229 954927 1230428 990440 1122473 925162 905592 974745
## [257] 899828 1102482 836614 1169996 961245 1235007 1095372 1046027
## [265] 924066 1135450 931227 1107213 996510 1193829 925931 1137605
## [273] 1091297 1090653 924615 940142 1219068 1146677 930235 924637
## [281] 1108628 1072695 1018279 1080937 1109558 1026097 1212125 1006866
## [289] 1035245 905480 1015023 1080130 971971 1101080 1063589 972477
## [297] 1055627 881558 895558 1014296 1156309 1105838 1114920 959753
## [305] 1105397 1193025 894308 1153930 1133065 1108921 1005992 908476
## [313] 1124731 843648 1104667 1216695 1187057 866202 885990 924628
## [321] 887810 875666 1120935 911586 955360 1087132 895459 1075237
## [329] 1083058 950607 1105211 1091779 1226289 1188772 984135 1012949
## [337] 1038363 1198515 985955 1171593 1152960 908000 918514 1101815
## [345] 978521 1066485 950953 899188 1055937 941065 1096697 938171
## [353] 1129082 1087480 1093077 1060452 1190639 996453 951647 1189662
## [361] 917286 861876 1035339 1127013 902703 1177661 1008145 1142914
## [369] 929177 890186 1105533 1079945 889172 1052711 1134173 941636
## [377] 1032488 1027633 1151090 1220787 1040968 929942 919601 1192167
## [385] 1128993 1063482 977573 966050 1048472 944577 897127 898738
## [393] 1058741 990861 858364 976387 1173207 900683 1050774 1027971
## [401] 935199 877050 1166367 948342 1030359 921976 1127419 1204854
## [409] 1034092 927080 1198260 1142723 1099029 957621 1155578 890626
## [417] 1032477 1116107 995810 1099966 1048880 945525 1053022 997526
## [425] 977465 1173194 1167951 909248 1033614 1078459 1002118 1049543
## [433] 1133843 1026058 1105032 873019 1035968 1149519 1070208 1089453
## [441] 858972 979693 912341 1175222 934674 942184 951130 1040872
## [449] 1139308 1015624 936523 1042913 1237755 1138811 1012357 1068698
## [457] 910870 910395 920632 907826 1195494 1193526 934888 1104995
## [465] 1083653 906461 982609 1075677 1108589 1044047 918397 1030503
## [473] 1129110 976849 933684 1071984 972553 1132599 968628 1161127
## [481] 917385 1137949 1181090 1009580 1163161 928893 1056857 963475
## [489] 904644 1032668 1035436 1022216 944328 1106262 1018790 918706
## [497] 1033746 1147337 1159676 1026847 910144 1106190 916824 1118334
## [505] 935390 1086699 1139790 990978 975021 1153116 1098485 1171773
## [513] 900183 1042284 1135711 955723 1036715 1115376 1075423 859687
## [521] 1107254 935509 965062 973472 891204 1024450 1052645 870435
## [529] 948890 996569 1134224 1156730 1092244 1034392 1084783 815571
## [537] 945349 1182462 999405 864112 1047212 925440 1187090 960750
## [545] 985786 941212 786407 870200 1199292 1145814 1218234 893972
## [553] 959341 1037620 1030987 1140662 1060954 891559 1082764 964890
## [561] 1042325 1018607 1202259 933612 822680 936550 1120346 943121
## [569] 1176642 891086 864367 1040426 920831 1119401 1124318 1148010
## [577] 915816 909549 1036380 991912 1104523 1074688 1208246 883224
## [585] 1125268 1135094 928970 1198955 1085063 1061117 981714 895539
## [593] 1017064 968296 1055833 918138 1168637 1098360 1143717 1029688
## [601] 966565 997774 1058244 997194 1039311 1191338 955717 1229526
## [609] 922934 964598 974060 917956 1022689 1224227 1169331 1094929
## [617] 1031134 954024 975596 901736 1035967 920137 1202616 1111002
## [625] 917584 929915 1020649 1144412 989379 1147593 900306 903152
## [633] 1220004 1049727 880347 892078 948758 1036697 954385 971814
## [641] 1092833 871221 1033925 1210766 1080072 1107812 1047291 1013531
## [649] 1144803 932783 876871 1077016 896302 1029573 1113053 1226520
## [657] 1011660 946122 1055666 1201145 1171682 1189679 934107 1001424
## [665] 1209237 1108405 937425 1014533 1150451 1095389 1033033 932443
## [673] 901899 1070052 1112653 955359 1081388 1126153 904364 1209105
## [681] 991895 1026283 983809 1212311 1224533 1115681 1069263 896913
## [689] 993074 1131669 859848 1100742 932923 1129327 1036638 963489
## [697] 1127948 887683 1187075 1035715 956707 884082 964344 1138016
## [705] 951488 904382 964381 900607 927487 1016680 880017 1042995
## [713] 1031588 1197627 1240650 1120250 984017 1230959 962425 1148810
## [721] 1051243 1028176 1022898 1185586 1187684 884368 929848 958521
## [729] 1094025 1051319 1012256 979235 899894 1029460 879634 1009856
## [737] 879245 903826 983041 941374 1196181 989166 1212242 1036020
## [745] 1092866 1194914 996641 980406 1024431 1159061 949677 970092
## [753] 923110 1159834 924418 949840 1125929 929393 1091627 1021028
## [761] 1034667 976026 1056808 1140825 1004014 960183 925275 941303
## [769] 1090829 1000151 1088912 998446 930906 1193964 943141 852758
## [777] 1031589 967122 1080320 897290 1078902 927999 900792 931671
## [785] 1130881 973282 985579 905260 1142120 884862 960170 1116581
## [793] 959620 935518 1090758 994090 1051425 1009196 905371 1049386
## [801] 1065219 1128831 1048512 996931 827053 916391 978862 1184373
## [809] 1188953 969859 880829 1054075 1221869 1092778 1060784 1088357
## [817] 1091759 955489 942150 1197011 983181 1163997 1071926 933395
## [825] 1158474 904020 1002326
# So we find 827 entries which have duplicate records. So for my analysis I needed to reject these customers as they would bring additional details in my observation which may not be very useful.
Univariate Plots Section
After the data cleaning process, now I can proceed with the analysis. First I am checking single variables and their behavior. So first lets check whats the summary of loan taken and its statistic by exploring the price range in which maximum loan is taken.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6100 8293 12000 35000
We can see that there is an outlier in the plot after 30000.
## 'data.frame': 112239 obs. of 33 variables:
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 587746 213551 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 54939 15374 ...
## $ Term : int 36 36 36 36 36 60 36 36 60 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 3 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 538 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperRating (numeric) : int NA 6 NA 6 3 5 2 4 4 NA ...
## $ ProsperRating (Alpha) : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 5 1 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 7 NA ...
## $ ListingCategory (numeric) : int 0 2 0 16 2 1 1 2 1 0 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 22 1 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 3 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 1 1 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 2 2 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 56 10 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 4 1 ...
## $ OpenRevolvingMonthlyPayment: num 24 389 0 115 220 1410 214 101 25 40 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 2 5 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 1 3 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 28 1 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 1 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.51 0.32 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 47 7 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.12 0.27 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 6 2 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 13500 1000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 19 53 ...
##
## Cancelled Chargedoff Completed
## 5 11992 38048
## Current Defaulted FinalPaymentInProgress
## 54928 5018 201
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 794 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 359 309 304
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7209 17046 31703 30463
## $75,000-99,999 Not displayed Not employed
## 16650 7741 806
Now lets see the distribution of loan status across income. This will be interesting
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## 'data.frame': 112239 obs. of 33 variables:
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 587746 213551 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 54939 15374 ...
## $ Term : int 36 36 36 36 36 60 36 36 60 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 3 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 538 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperRating (numeric) : int NA 6 NA 6 3 5 2 4 4 NA ...
## $ ProsperRating (Alpha) : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 5 1 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 7 NA ...
## $ ListingCategory (numeric) : int 0 2 0 16 2 1 1 2 1 0 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 22 1 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 3 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 1 1 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 2 2 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 56 10 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 4 1 ...
## $ OpenRevolvingMonthlyPayment: num 24 389 0 115 220 1410 214 101 25 40 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 2 5 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 1 3 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 28 1 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 1 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.51 0.32 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 47 7 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.12 0.27 ...
## $ IncomeRange : Ord.factor w/ 8 levels "$0"<"$1-24,999"<..: 3 4 8 3 6 6 3 3 5 2 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 13500 1000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 19 53 ...
So now lets see the Employment status data.
##
## Employed Full-time Not available Not employed
## 2255 65912 26353 5347 835
## Other Part-time Retired Self-employed
## 3681 1088 795 5973
##
## Employed Full-time Not available Not employed
## 2255 65912 26353 5347 835
## Other Part-time Retired Self-employed
## 3681 1088 795 5973
Next I change the Listing Category numeric data to factor variable with designated values.
0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
## Factor w/ 11 levels "Auto","Business",..: 6 4 6 1 4 3 3 4 3 6 ...
Debt to Income Ratio
A debt income ratio is the percentage of a consumer’s monthly gross income that goes toward paying debts. The data is capped at 10.01, debt-to-income ratio larger then 1000% will be returned as 1001%.
So lets take a look at the plot.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8393 rows containing non-finite values (stat_bin).
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 797 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8393
## [1] "A" "AA" "B" "C" "D" "E" "HR"
The most common rating is rating C, 17862. Only 5280 listing have AA rating or about 6.34% of the data. That’s very interesting so we get the idea that most loan takers have an average rating. No one pays on time then.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 42.44 60.00 60.00
Most loans have 36 months term.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 13.59 18.97 19.66 25.79 36.00
The median for the borrower rate is 18.97% and mean 19.66%. The maximum borrower rate is 36.00%.
What is/are the main feature(s) of interest in your dataset?
While there are many variables which i have chosen, the main features of the data are:
I chose these variables, because I think these are some imporatant factors while lending money.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I think State abbreviation will play a part in knowing which state are vulnerable in Loan frauds. Also listing category will tell us about the loan category for which a person is borrowing money.
Did you create any new variables from existing variables in the dataset?
Yes, ListingCategoryNum.new and EmploymentStatus.char.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
The distribution i plotted mainly are positively skewed ie. DebtToIncomeRatio and LoanOriginalAmount . I did plot some graphs using faceting technique to get a hint of what’s going on in a factored variable. I changed Listing Category to a new varibale with simplified categories.
So now lets move to two variable analysis to find some interesting patterns
LoanOriginalAmount vs Investors
First lets take a look at LoanOriginalAmount vs Investors such that we can get an idea upto how much an ivester is willing to lend their money and how many invester line up for that amount.
## Warning: Removed 827 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 30.0 110.1 202.0 1189.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 35.00 48.45 73.00 558.00
LoanOriginalAmount vs BorrowerRate
So next we analyze the LoanOriginalAmount vs BorrowerRate
## Warning: Removed 648 rows containing non-finite values (stat_smooth).
## Warning: Removed 1970 rows containing missing values (geom_point).
## [1] -0.413694
LoanOriginalAmount vs AnnualIncome
Now lets have a look at loan amount vs the income range. We will get to know which income range takes loan of upto 10000 dollars.
## Warning: Removed 406 rows containing non-finite values (stat_smooth).
## Warning: Removed 803 rows containing missing values (geom_point).
Most of the Loan are below 10000 and monthly income is under 10000. The quantile shows that the higher the monthly income the higher the median of the loan original amount.
The number of the data that have original amount < 10000 and annual income less then 100000 is:
## [1] 44656
which is around 53.7% of the data.
It seems that people who borrow > 25000 has monthly income of >= 10000 looks like there some kind of rule, that if you borrow > 25000 the the minimal monthly income should be at least 10000.
Let’s verify this a bit.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8333 9583 11667 13328 14583 66667
And yes the minimum monhtly income is 8333 for borrowing that amount of money. Let’s also check the correlation between the 2 variables.
## [1] 0.1813978
This relationship is quite weak, so we can’t be sure about any rules that for taking a loan your income should be a in a particular range.
LoanOriginalAmount vs ListingCategoryNum.new
Now lets have a look at listing category and amount of loan taken
## Warning: Removed 648 rows containing non-finite values (stat_boxplot).
So we see that the highest amount of loan takers come from the debt consolidation area. That will only increase their burden on the loan.
LoanOriginalAmount vs EmploymentStatus
Now we take a look at the employment status of the folks who take loans we have already seen this in Univariate analysis but lets have a more in depth analysis using box plots.
## Warning: Removed 648 rows containing non-finite values (stat_boxplot).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 5272 6750 25000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1500 2500 3364 4000 15500
Interesting to see that not employed folks are requesting loan higher then part time. The median and the mean of not employed borrower are 4000 and 5272 vs 2500 and 3364 from part time borrower. Max loans for unemployed is 25000 and it is also greater than max loan for part-time borrower which is 15500
StatedMonthlyIncome vs BorrowerRate
Now lets take a look at the borrower rate and Monthly Income
## Warning: Removed 714 rows containing missing values (geom_point).
We can see there is really not much of a relation, only thing we notice is that mainly people who borrow loan are earn below 15000 dollars. And I can see the more the income less is your interest rate let us check if there is any negative correlation between them.
## [1] -0.09317254
Okay so there is a negative correlation of -0.093 but this is very small, so we can’t really say anything.
Let’s see the relationship of the Loan Amount with other variables.
So we can see that it is the region between (0-15000) dollars which is the most active and as we can tell BankcardUtilization, Delinquencies, PublicRecordsLast10Years and RevolvingCreditBalance all these variables usually a loan taker in the above range has a certain record related to these which would hamper their prosper rating.
So it seems we should be careful while lending money below 15000 dollars at the least.
ProsperRatingAlpha vs LoanOriginalAmount
Let’s look into prosper rating also.
## Warning: Removed 648 rows containing non-finite values (stat_boxplot).
Lets take a look at the summary by LoanOriginalAmount
## prosperLoans.sub$ProsperRatingAlpha: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5700 10500 11547 16000 35000
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5750 10000 11434 15000 35000
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 10000 11580 15000 35000
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5000 10000 10356 15000 25000
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6000 7065 10000 15000
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3500 4000 4564 5000 15900
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 3462 4000 16800
So the maximum loan taken is 35000 dollars and it has got the highest prosper rating too AA type. The median loan value of 4000 dollars lead to a prosper rating of ‘E’ and ‘HR’.
I don’t understand why people with such low loan amount cannot pay thier debts in time.
Lets check only ‘E’ and ‘HR’ type rating with the Listing category.
So we can see that these folks are generally students and it is as expected. However there are 2 more category which has a large amount of ‘E’ and “HR” prosper rating those are Auto and Other types.
Note: Auto means those who have taken loan for automobile of any kind.
ProsperRatingAlpha vs Term
Now lets look at prosper rating and Loan term
## [1] "12" "36" "60"
Take a look at the summary of each prosper rating with respect to Term
## prosperLoans.sub$ProsperRatingAlpha: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 38.34 36.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 41.56 60.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 45.11 60.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 45.96 60.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 42.43 60.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 39.88 36.00 60.00
## --------------------------------------------------------
## prosperLoans.sub$ProsperRatingAlpha: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36 36 36 36 36 36
So mostly people take a term of 36 months or 3 years.
Now let’s look at the 3 categories we found earlier in Lending category and see their term, borrower rate,TotalInquiries, DelinquenciesLast7Years, PublicRecordsLast10Years
SO we can clearly see people who take loans for Auto purposes are the most risky ones.
Now let’s look at the correlation of Prosper rating with some variables and find out which factors are more important.
Correlation between BorrowerRate with Prosper Rating:
## [1] -0.9531032
Correlation between Prosper rating with RevolvingCreditBalance:
## [1] 0.06221151
Correlation between Prosper rating with BankcardUtilization:
## [1] -0.2658791
Correlation between Prosper rating with Term:
## [1] 0.08213252
Correlation between Prosper rating with OpenRevolvingAccounts:
## [1] 0.1241319
Correlation between Prosper rating with DebtToIncomeRatio:
## [1] -0.1353105
Correlation between Prosper rating with LoanOriginalAmount:
## [1] 0.4298129
So prosper rating is higly dependent on Borrower rate. They have a negative correlation of -0.9531032.
2nd is Bank card utilization with a negative correlation of -0.2658791.
3rd is Open revolving accounts with a positive correlation of 0.1241319.
Finally Prosper rating is positively correlated with loan amount having pearson’s R of 0.4298129
So Borrower rate and loan amount is the thing to watch here along with the other 2 parameters.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I wanted to see how several features affect the LoanOriginalAmount. So i took help of Investors, BorrowerRate, StatedMonthlyIncome, ListingCategoryNum.new, EmploymentStatus and ProsperRatingAlpha.
For almost every case i used 99 % of the data removin any outliers from my investigation.
I found that the maximum loan taken is 35000 dollars and it has got the highest prosper rating too AA type. The highest amount of loan takers come from the debt consolidation area. I observed that students, Auto and Other types listing category fill up the ‘E’ and “HR” prosper rating.
SO we can clearly see people who take loans for Auto purposes are the most risky ones.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
No.
What was the strongest relationship you found?
So i found that mainly 3 types of listing category ie. Auto, Other and Students are the most vulnerable to late payments or being a defaulter. On the basis of this I plotted some boxplot for these 3 categories and other variables such as Term, BorrowerRate, DelinquenciesLast7Years, PublicRecordsLast10Years and TotalInquiries so that we can find out some data having some bad records to their name. We find that mainly there are records for Auto and Other categories in these plots. So for these 2 listing category the money lenders should take precaution and check the background of the person to whom they are lending money.
I also plotted ProsperRatingAlpha vs Term but for Term it is not affected by the rating, the most common terms is 36 across the ratings.
For LoanOriginalAmount vs BorrowerRate We find out that as loan amount increases the rate of interest decreases having pearson’s R as -0.413694.
Finally i found out correlation between ProsperRatingNum and various variables and found out prosper rating is higly dependent on Borrower rate with a negative correlation of -0.9531032 and is positively correlated with loan amount having pearson’s R of 0.4298129.
So these are some of the strongest relationships i have found.
So we see that prosper rating has a positive correlation for loan amount and Open revolving accounts. So let us first plot these variables with Monthly income
I am transforming Monthly income to yearly income to see the trend
options(scipen = 999)
ggplot(aes(x = StatedMonthlyIncome * 12, y = LoanOriginalAmount),
data = subset(prosperLoans.sub,
StatedMonthlyIncome*12 <
quantile(StatedMonthlyIncome*12, 0.99, na.rm = TRUE))) +
geom_point(aes(color=ProsperRatingAlpha), alpha = 0.5,
size = 1, position = 'jitter') +
scale_x_log10() +
scale_color_brewer(type="div", palette = 3) +
xlab("Yearly Income")
## Warning: Transformation introduced infinite values in continuous x-axis
We see that those who took a loan amount above 25000 dollars are in AA rating and as the loan amount increases the rating increases. This proves our correlation. This was wierd at first but then i recalled the plot i made earlier from which i got that mostly people with low prosper rating are students, Auto and Other types.
Now next we see which ListingCategory has most of the loans which correspond to their low prosper rating.
ggplot(aes(x = StatedMonthlyIncome * 12, y = LoanOriginalAmount),
data = subset(prosperLoans.sub,
(StatedMonthlyIncome*12 <
quantile(StatedMonthlyIncome*12, 0.99, na.rm = TRUE)) &
ListingCategoryNum.new == "Auto" |
ListingCategoryNum.new == "Student" |
ListingCategoryNum.new == "Other/NA")) +
geom_point(aes(color=ListingCategoryNum.new), alpha = 0.5, size = 1, position = 'jitter') +
scale_x_log10() +
scale_color_brewer(type="fill", palette = "PuBu")+
xlab("Yearly Income")
## Warning: Transformation introduced infinite values in continuous x-axis
So we see that most of the loans below 25000 dollars comprise of Other or Auto Listing Category.
Alright let’s see the plot of OPen revolving accounts vs TotalCreditLinespast7years
ggplot(aes(x = OpenRevolvingAccounts, y = TotalCreditLinespast7years),
data = prosperLoans.sub) +
geom_point(aes(color=ProsperRatingAlpha), alpha = 0.5,
size = 1, position = 'jitter') +
scale_x_continuous() +
scale_color_brewer(type="div", palette = 3)
So I wanted to see how many credit profiles a person had and their open revolving accounts. It seemed to me that they both are positively correlated and isn’t it obvious more the accounts more credit lines a person has.
However I could not make anything out of the prosper rating, it looked very mixed up.
Now let’s look at loan amount vs AmountDelinquent
ggplot(aes(x = LoanOriginalAmount, y = AmountDelinquent),
data = subset(prosperLoans.sub,
(AmountDelinquent <
quantile(AmountDelinquent, 0.99, na.rm = TRUE)))) +
geom_point(aes(color=ProsperRatingAlpha), alpha = 0.5,
size = 1, position = 'jitter') +
scale_x_continuous() +
scale_color_brewer(type="div", palette = 3)
As expected we see that people with less amount Delinquent fall into good Prosper rating. Most of the people who has large amount delinquent have low propser rating, however thier loans are not that much but still they don’t pay in time. That is some strange characteristic.
Now we check the PublicRecordsLast10Years vs Loan amount
ggplot(aes(x = LoanOriginalAmount, y = PublicRecordsLast10Years),
data = subset(prosperLoans.sub,
(PublicRecordsLast10Years < 10))) +
geom_point(aes(color=ProsperRatingAlpha), alpha = 0.5,
size = 1, position = 'jitter') +
scale_x_continuous() +
scale_color_brewer(type="div", palette = 3)
Here we notice that more the public record less is the prosper rating. But again people whose loan are less than 10000 dollars have more public records.
Now let’s see how many of them are students or in other listing category
ggplot(aes(x = LoanOriginalAmount, y = PublicRecordsLast10Years),
data = subset(prosperLoans.sub,
(PublicRecordsLast10Years < 10) &
(ListingCategoryNum.new == "Auto" |
ListingCategoryNum.new == "Student" |
ListingCategoryNum.new == "Other/NA"))) +
geom_point(aes(color=ListingCategoryNum.new), size = 1,
position = 'jitter') +
scale_x_continuous() +
scale_color_brewer(type="div", palette = 'RdBu' )
So from each plot i can see that those who take loan for Auto and Other have past records attached to them. Finally let’s look at the income range for these listing categories
ggplot(aes(x = IncomeRange, y = LoanOriginalAmount, fill = ListingCategoryNum.new),
data = subset(prosperLoans.sub,
ListingCategoryNum.new == "Auto" |
ListingCategoryNum.new == "Student" |
ListingCategoryNum.new == "Other/NA")) +
geom_boxplot()
Finally lets take out the summary of yearly income wrt to these particular listing category. From this we get an idea of how much these folks earn. By looking at the above plot it seems the Other category earns much more than the other two. I can’t understand what’s their problem in settling their loan amount in time. But in future if you want to lend someone money, one should check if they are particularly from these background and proceed with caution.
Finally getting back to my final analysis and after that I will model the data for prdicting Prosper rating
First let me make a new data frame which contains summarized values for these 3 listing categories just for reference.
listing_summary <- subset(prosperLoans.sub,
ListingCategoryNum.new == "Auto" |
ListingCategoryNum.new == "Student" |
ListingCategoryNum.new == "Other/NA") %>%
group_by(ListingCategoryNum.new) %>%
summarise (mean_Income = round(mean(StatedMonthlyIncome*12), digits = 0),
median_Income = round(median(StatedMonthlyIncome*12), digits = 0),
mean_loan = round(mean(LoanOriginalAmount), digits = 0),
max_loan = round(max(LoanOriginalAmount), digits = 0),
min_loan = round(min(LoanOriginalAmount), digits = 0),
number_of_people = n())
#round(x, digits = 0)
View(listing_summary)
Obviously we can see that most of the people approximately 10339 fall in Other/NA listing category and they are the ones who has highest earnings among the other 2 categories but still they have a bad name and do not pay loan in time or get defaulted.
Now, let’s do our model for predicting ProsperRating. I am going to add in the individual features for this.
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
##
## collect, recode, rename
## The following object is masked from 'package:plyr':
##
## rename
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
m1 <- lm(ProsperRatingNum ~ LoanOriginalAmount, data = prosperLoans.sub)
m2 <- update(m1, ~ . + DebtToIncomeRatio)
m3 <- update(m2, ~ . + OpenRevolvingAccounts)
m4 <- update(m3, ~ . + TotalTrades)
m5 <- update(m4, ~ . + StatedMonthlyIncome)
m6 <- update(m5, ~ . + TotalInquiries)
m7 <- update(m6, ~ . + BorrowerRate)
m8 <- update(m7, ~ . + BankcardUtilization)
mtable(m1, m2, m3, m4, m5, m6, m7, m8, sdigits = 3)
##
## Calls:
## m1: lm(formula = ProsperRatingNum ~ LoanOriginalAmount, data = prosperLoans.sub)
## m2: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio,
## data = prosperLoans.sub)
## m3: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts, data = prosperLoans.sub)
## m4: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts + TotalTrades, data = prosperLoans.sub)
## m5: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome,
## data = prosperLoans.sub)
## m6: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome +
## TotalInquiries, data = prosperLoans.sub)
## m7: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome +
## TotalInquiries + BorrowerRate, data = prosperLoans.sub)
## m8: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio +
## OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome +
## TotalInquiries + BorrowerRate + BankcardUtilization, data = prosperLoans.sub)
##
## =========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## -------------------------------------------------------------------------------------------------------------------------
## (Intercept) 3.026*** 3.291*** 3.199*** 3.279*** 3.259*** 3.514*** 8.144*** 8.207***
## (0.009) (0.011) (0.013) (0.014) (0.015) (0.015) (0.008) (0.008)
## LoanOriginalAmount 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## DebtToIncomeRatio -0.662*** -0.691*** -0.688*** -0.670*** -0.704*** -0.081*** -0.077***
## (0.017) (0.017) (0.017) (0.017) (0.017) (0.006) (0.006)
## OpenRevolvingAccounts 0.016*** 0.026*** 0.026*** 0.027*** -0.006*** -0.005***
## (0.001) (0.001) (0.001) (0.001) (0.000) (0.000)
## TotalTrades -0.006*** -0.007*** -0.001 0.001*** 0.002***
## (0.001) (0.001) (0.001) (0.000) (0.000)
## StatedMonthlyIncome 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000)
## TotalInquiries -0.088*** -0.011*** -0.014***
## (0.001) (0.000) (0.000)
## BorrowerRate -20.864*** -20.519***
## (0.027) (0.028)
## BankcardUtilization -0.262***
## (0.006)
## -------------------------------------------------------------------------------------------------------------------------
## R-squared 0.185 0.194 0.195 0.197 0.197 0.236 0.912 0.914
## adj. R-squared 0.185 0.193 0.195 0.197 0.197 0.236 0.912 0.914
## sigma 1.515 1.493 1.492 1.490 1.490 1.453 0.494 0.489
## F 18842.573 9119.598 6152.818 4653.560 3733.543 3920.892 112002.454 100520.274
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -152519.030 -138351.932 -138263.518 -138200.794 -138179.222 -136277.381 -54311.232 -53430.710
## Deviance 190782.908 169525.245 169131.377 168852.504 168756.700 160520.678 18578.153 18152.727
## AIC 305044.061 276711.865 276537.037 276413.588 276372.443 272570.762 108640.464 106881.420
## BIC 305072.046 276748.820 276583.231 276469.020 276437.115 272644.672 108723.613 106973.808
## N 83155 76020 76020 76020 76020 76020 76020 76020
## =========================================================================================================================
The overall R-squared value is 0.914, which is very strong;Wow I am surprised that my model will be predicting the ProsperRating with this much accuracy. And as it looks BorrowerRate might be the most useful varable in our list for predicting Prosper Rating.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Loans with higher ProsperRating have higher Loan amount borrowed and vice versa. Yearly Incmone is not a factor for Prosper Rating.
I focussed my investigation on the 3 listing categories I found from Bivariate analysis that have low prosper rating. And i found that most loans were taken by Other and Auto listing category and some going even higher than 20000 dollars.
Next i got that folks with low Amount Delinquent have higher Prosper Rating and vice versa, same is for number of Public Records. Most of the people who has large amount delinquent or more public records have low propser rating, however thier loans are not that much big amount but still they don’t pay in time.
So finally i see that those who take loan for Automobile and Other purposes which are Large Purchase or some green loans or cosmetic surgery are very vulnerable to untimely payments and have past records attached to them.
Were there any interesting or surprising interactions between features?
Yes there was in one of my final charts while investigating the 3 listing categories i found that those who take loans with Other Listing Category earns much more than the other two. I can’t understand their problem in settling their loan amount in time. So in future the bank or lender should check their background in these categories especially Auto and Other and borrow money.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
Yes, I created a linear model with dependent variable as the ProsperRating and the predictors as the Loan Original Amount, DebtToIncomeRatio, OpenRevolvingAccounts, TotalTrades, StatedMonthlyIncome, TotalInquiries, BorrowerRate and BankcardUtilization.
The variables in the linear model account for 91.4% of the variance in the Prosper Rating. This means that there is a 91.4% chance that the model will hold true for predicting the Prosper Rating and that’s quite a model i created unknowingly. Mostly if you use this model you would get an accurate idea of Prosper Rating.
So I guess, the model’s limitation would be the huge amount of variables that I did not use for predicting ProsperRating. I only choose these 8 variables as per my above analysis. I don’t know how adding the other variables would effect my model.
So for my final plots I choose to see the data over a chloropleth. All this time I did my analyis build the model for predicting Prosper Rating. Now let’s use the Borrower state data to look at other insights from a birds eye view.
Okay now lets take a look at statewise distribution of mean loan amount.
So for that we need to use maps package and merge the state data with our data.
#x <- c("NY", "MA")
#state.name[match(x,state.abb)]
#Creating a new data frame with state data imported from maps package
#install.packages('maps')
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
library(ggmap)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
states_map <- map_data("state")
#states_map
#Matching state data of peosper loan data set to state.abb(state abbreviations) and getting the state names
prosperLoans.sub$State <- state.name[match(prosperLoans.sub$BorrowerState, state.abb)]
#prosperLoans.sub$State
#Converting the state names to lower case for the join i need to apply
prosperLoans.sub$State <- sapply(prosperLoans.sub$State, tolower)
#prosperLoans.sub$State
#Summarizing the data and finding out mean loans for each state
df_state <- subset(prosperLoans.sub, !is.na(DebtToIncomeRatio)) %>% group_by(State) %>%
summarize(mean_loan = mean(LoanOriginalAmount),
Total_Inquiries = sum(TotalInquiries),
Total_Records = sum(PublicRecordsLast10Years),
no_of_investors = sum(Investors))
by(prosperLoans.sub$DebtToIncomeRatio, prosperLoans.sub$State, summary)
## prosperLoans.sub$State: alabama
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2500 0.2838 0.3400 7.2700 78
## --------------------------------------------------------
## prosperLoans.sub$State: alaska
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1600 0.2400 0.2547 0.3425 0.6700 17
## --------------------------------------------------------
## prosperLoans.sub$State: arizona
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1450 0.2100 0.2354 0.3000 1.2800 88
## --------------------------------------------------------
## prosperLoans.sub$State: arkansas
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1800 0.2700 0.2995 0.3600 7.2700 65
## --------------------------------------------------------
## prosperLoans.sub$State: california
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1300 0.2000 0.2268 0.2800 10.0100 995
## --------------------------------------------------------
## prosperLoans.sub$State: colorado
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1600 0.2300 0.2713 0.3100 10.0100 165
## --------------------------------------------------------
## prosperLoans.sub$State: connecticut
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1400 0.2100 0.2577 0.3000 10.0100 103
## --------------------------------------------------------
## prosperLoans.sub$State: delaware
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2300 0.2917 0.3200 10.0100 10
## --------------------------------------------------------
## prosperLoans.sub$State: florida
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1600 0.2300 0.2666 0.3200 10.0100 557
## --------------------------------------------------------
## prosperLoans.sub$State: georgia
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1600 0.2300 0.2701 0.3300 10.0100 290
## --------------------------------------------------------
## prosperLoans.sub$State: hawaii
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1600 0.2300 0.2619 0.3350 2.2800 31
## --------------------------------------------------------
## prosperLoans.sub$State: idaho
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2600 0.2998 0.3500 6.4900 41
## --------------------------------------------------------
## prosperLoans.sub$State: illinois
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1500 0.2200 0.2482 0.3100 10.0100 318
## --------------------------------------------------------
## prosperLoans.sub$State: indiana
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1600 0.2400 0.2754 0.3200 10.0100 114
## --------------------------------------------------------
## prosperLoans.sub$State: kansas
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1800 0.2500 0.2762 0.3500 1.6000 62
## --------------------------------------------------------
## prosperLoans.sub$State: kentucky
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1800 0.2600 0.3037 0.3500 8.5200 63
## --------------------------------------------------------
## prosperLoans.sub$State: louisiana
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2650 0.2933 0.3900 1.1500 61
## --------------------------------------------------------
## prosperLoans.sub$State: maryland
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1400 0.2100 0.2313 0.2900 2.4500 146
## --------------------------------------------------------
## prosperLoans.sub$State: massachusetts
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1400 0.2000 0.2392 0.2900 10.0100 111
## --------------------------------------------------------
## prosperLoans.sub$State: michigan
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1600 0.2300 0.2616 0.3300 5.3800 205
## --------------------------------------------------------
## prosperLoans.sub$State: minnesota
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1600 0.2250 0.2532 0.3200 10.0100 148
## --------------------------------------------------------
## prosperLoans.sub$State: mississippi
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1800 0.2600 0.2877 0.3700 1.1500 57
## --------------------------------------------------------
## prosperLoans.sub$State: missouri
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2400 0.2683 0.3300 6.0700 166
## --------------------------------------------------------
## prosperLoans.sub$State: montana
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1700 0.2700 0.3111 0.3700 5.9800 25
## --------------------------------------------------------
## prosperLoans.sub$State: nebraska
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0400 0.1900 0.2700 0.2876 0.3600 1.1900 47
## --------------------------------------------------------
## prosperLoans.sub$State: nevada
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1500 0.2200 0.2516 0.3100 10.0100 76
## --------------------------------------------------------
## prosperLoans.sub$State: new hampshire
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1500 0.2300 0.2571 0.3200 4.4300 28
## --------------------------------------------------------
## prosperLoans.sub$State: new jersey
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.140 0.210 0.237 0.290 10.010 213
## --------------------------------------------------------
## prosperLoans.sub$State: new mexico
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1700 0.2600 0.2707 0.3400 1.0800 35
## --------------------------------------------------------
## prosperLoans.sub$State: new york
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1300 0.2000 0.2394 0.2900 10.0100 543
## --------------------------------------------------------
## prosperLoans.sub$State: north carolina
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1600 0.2300 0.2681 0.3250 10.0100 202
## --------------------------------------------------------
## prosperLoans.sub$State: ohio
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2500 0.2837 0.3500 10.0100 253
## --------------------------------------------------------
## prosperLoans.sub$State: oklahoma
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1700 0.2500 0.2764 0.3500 2.5300 55
## --------------------------------------------------------
## prosperLoans.sub$State: oregon
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.160 0.230 0.258 0.320 10.010 140
## --------------------------------------------------------
## prosperLoans.sub$State: pennsylvania
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1600 0.2400 0.2651 0.3200 6.8500 213
## --------------------------------------------------------
## prosperLoans.sub$State: rhode island
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0400 0.1400 0.2100 0.2481 0.2900 6.6800 27
## --------------------------------------------------------
## prosperLoans.sub$State: south carolina
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.170 0.250 0.299 0.350 10.010 100
## --------------------------------------------------------
## prosperLoans.sub$State: south dakota
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.050 0.210 0.280 0.293 0.360 0.600 23
## --------------------------------------------------------
## prosperLoans.sub$State: tennessee
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1700 0.2500 0.2913 0.3500 10.0100 121
## --------------------------------------------------------
## prosperLoans.sub$State: texas
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1700 0.2400 0.2687 0.3400 4.6800 460
## --------------------------------------------------------
## prosperLoans.sub$State: utah
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.160 0.230 0.276 0.310 10.010 68
## --------------------------------------------------------
## prosperLoans.sub$State: vermont
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1700 0.2700 0.2924 0.3800 1.2100 14
## --------------------------------------------------------
## prosperLoans.sub$State: virginia
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1500 0.2200 0.2545 0.3200 10.0100 210
## --------------------------------------------------------
## prosperLoans.sub$State: washington
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1500 0.2200 0.2566 0.3100 10.0100 200
## --------------------------------------------------------
## prosperLoans.sub$State: west virginia
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.1800 0.2600 0.2873 0.3600 4.1600 27
## --------------------------------------------------------
## prosperLoans.sub$State: wisconsin
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0200 0.1600 0.2300 0.2788 0.3300 10.0100 135
## --------------------------------------------------------
## prosperLoans.sub$State: wyoming
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0300 0.2400 0.3000 0.3237 0.3800 1.3900 14
str(df_state)
## Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of 5 variables:
## $ State : chr "alabama" "alaska" "arizona" "arkansas" ...
## $ mean_loan : num 8660 10354 8886 8098 9847 ...
## $ Total_Inquiries: num 3953 711 7131 2331 53577 ...
## $ Total_Records : int 370 26 374 251 2457 507 403 75 1173 945 ...
## $ no_of_investors: int 67172 10498 77420 36916 719889 104363 97664 18222 348568 235391 ...
#Renaming the columns
#df_state <- rename(df_state, state = BorrowerState_new)
#making a new data frame which contains the centre coordinate of each state and its abbreviation
states <- data.frame(state.center, state.abb,state.name)
library(ggthemes)
# creating new data frame by merging states_map & df_state
new_map <- merge(states_map, df_state, by.x = "region", by.y = "State")
new_map <- arrange(new_map, group, order) # to sort polygons in right order
Plot One
## Warning: Ignoring unknown parameters: linestyle
This chloropleth shows the statewise distribution of mean loan amount. So from the map we see that folks living in California, Texas, New Mexico, Massachusetts, New Jersey and Virginia take huge amount of loans as compared to other states.
Plot Two
Now let’s see the number of investors of each state along with loan amount.
#First we need to make a new dataframe which contains the summarized value and the centers of each state
states$state.name <- sapply(states$state.name, tolower)
summarized_map <- merge(states, df_state, by.x = "state.name", by.y = "State")
new_map <- arrange(new_map, group, order) # to sort polygons in right order
## Warning: Ignoring unknown parameters: linestyle
We can see that California, Texas, New York, Illinois, Florida, Georgia, Ohio and Virginia are some main hubs for investors. Maybe these places are safe for lending your money or giving loans. These are the main states to look for while lending money sine there are so many investors in these places
Plot Three
Now lets look at the total number of Inquiries and Records for the States.
## Warning: Ignoring unknown parameters: linestyle
So in the last plot we are able to see which states have a bad reputation in settling loans. So we can see that California, Texas, Florida, Illinois, Virginia, New York, Ohia and Michigan have some records associated with Inquiries. So this suggests that people living in these states have been inquired previously for unsettling their loans or have a bad background.
The Prosper data set originally contains 113,937 records with 81 features and I was able to trim it down to 83,155 records; removing records with duplicate ListingKeys and those with NA and empty ProsperRating.
There are a lot of challenges which I faced with this project. One is not being familiar with Prosper, so going through their website and reading information on Prosper listing certainly helped, especially in understanding the different features of the Prosper loan.
Another challenge was choosing which feature/variable was the most interesting. The problem with Prosper is that it has many interesting features, so it is hard to choose which one is the most interesting, like ProsperScore, Borrower’s Rate, Loan Original Amount etc. Among all these Prosper Rating was the most important feature since it is a measurement of a Loan’s risk. So this made me interested on seeing the listing category which would have low Prosper Rating since they would be the most vulnerable in settling their loans.
The most difficult task I find in this study was choosing the 20 - 30 features out of 81. So I mainly choose the variable as per my instinct and which I was familiar with. So, I read about the features from the Prosper website and finally, I came up with 33 features. But after all my analysis I saw that mainly I used 20 - 22 features for my analysis.
While starting this project I had one thing in mind to find out type of people who make up risky loans. So i carried out my analysis accordingly. First I updated the listing category to fewer categories for my ease of use. Since all were numbered I renamed them to their original meaning so that i could understand what i am dealing with. And I made 11 categories out of it. And after Bivariate Analysis I got the idea that people with Listing Category Student, Other/NA and Auto are very risky. It was after Multivariate analysis that I concluded that loan for Automobile and Other purposes which include Large Purchase, green loans and cosmetic surgery are very vulnerable to untimely payments and have past records attached to them.
One thing i am happy about is my Linear Model Which I made since it has an acuuracy of 91.4%. I don’t know but by hit and trial i got the best variables for the model.
One thing i want to look at is a more improved way to look at the data over a map/chloropleth. In my final plots section I created map using State name abbreviation provided in the data. But there are some techniques in this which are unknown to me. In my third plot the legends are overlapping and I am yet to find a solution to that. This is a very tricky thing but i hope to find some solution to this problem. Coming back to data there are many unused variables. We can use other variables also, In the data there is information on borrower credit score. So it will be interesting to see what kind of relationship exists between credit score and ProsperRating.
[1] https://www.prosper.com/help/ [2] https://www.prosper.com/help/topics/how-to-read-a-loan-listing/ [3] www.google.com [4] https://s3.amazonaws.com/udacity-hosted-downloads/ud651/GeographyOfAmericanMusic.html [5] http://www.statmethods.net/ [6] http://www.cookbook-r.com/Manipulating_data/ [7] https://stackoverflow.com/questions/17723822/administrative-regions-map-of-a-country-with-ggmap-and-ggplot2